Performance comparison of word filtering functions Performance comparison of word filtering functions (Part 2)

Performance comparison of word filtering functions

Introduction

For one of my programs, I need to filter and remove some words from a long text. I need to remove pronouns, conjugated verbs, determiners and some other words. other words.

The first way to do this would be to simply filter the word list wrdstxt by testing the absence of each word in the wrdsfilt list using notElem:

filterByList (wrdsfilt, wrdstxt) = filter (`notElem` wrdsfilt) wrdstxt

For a small number of words used as a filter, this function would be suitable, but as the number of words to be deleted increase, the processing time will increase too¹.

So we have to find a way to optimize the process.

The solution would be to store the words not in a list but in a search tree. The words sorted and organized in this way would be faster to search.

After searching for documentation on the subject, I found several solutions to answer my problem.

But what is the most efficient solution?

That's what we will see in this benchmark!

Methodology

Material

This test will be performed on a machine running Debian 10 Linux with an AMD Ryzen 7 3700X 8 core 16 thread processor with 32GB of DDR4 memory.

Generality

The benchmark will be done with Haskell and Criterion with all compiler optimizations enabled ghc -o2 -fforce-recomp.

The list of words to be filtered will be this one: Mots_a_filtrer.txt

The text to scan will be Travel to the center of the earth retrieved from the Gutenberg site

The words will be loaded before the test, in order not to distort the results.

The search trees will be created during the test in order to take into account the time needed to create the tree.

Solutions

The different solutions reviewed are :

Data.Set

The container package provides various standard modules for storing data and in particular the Set.

This structure is a binary search tree (2 branches) automatically balanced when it is creation.

Multibranch custom tree (WordTree1)

This tree is a tree of my own design containing a branch for each letter of the alphabet. The order of the trees corresponds to the alphabetical order: abcdefghijklmnopqrstuvwxyz

This tree can't be rebalanced by itself.

The member function to test the presence of an element is implemented in this way:

member v0 tree = go v0 tree
    where
        go _  Tip = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t24 t25 t26) | a == 'a'  = go as t1
                                                                                                                             | a == 'b'  = go as t2
                                                                                                                             | a == 'c'  = go as t3
                                                                                                                             | a == 'd'  = go as t4
                                                                                                                             | a == 'e'  = go as t5
                                                                                                                             | a == 'f'  = go as t6
                                                                                                                             | a == 'g'  = go as t7
                                                                                                                             | a == 'h'  = go as t8
                                                                                                                             | a == 'i'  = go as t9
                                                                                                                             | a == 'j'  = go as t10
                                                                                                                             | a == 'k'  = go as t11
                                                                                                                             | a == 'l'  = go as t12
                                                                                                                             | a == 'm'  = go as t13
                                                                                                                             | a == 'n'  = go as t14
                                                                                                                             | a == 'o'  = go as t15
                                                                                                                             | a == 'p'  = go as t16
                                                                                                                             | a == 'q'  = go as t17
                                                                                                                             | a == 'r'  = go as t18
                                                                                                                             | a == 's'  = go as t19
                                                                                                                             | a == 't'  = go as t20
                                                                                                                             | a == 'u'  = go as t21
                                                                                                                             | a == 'v'  = go as t22
                                                                                                                             | a == 'w'  = go as t23
                                                                                                                             | a == 'x'  = go as t24
                                                                                                                             | a == 'y'  = go as t25
                                                                                                                             | otherwise = go as t26

        go _ _ = False

The full module can be found at this adress : WordTree1.hs

Multibranch custom tree (WordTree2)

This tree is of the same type as the WordTree1 tree except that the order of the trees corresponds to the order of the most commonly used letters in the French language²: eairsntolucmpdghbfvyqkzx

member v0 tree = go v0 tree
    where
        go _  Tip = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t24 t25) | a == 'e'  = go as t1
                                                                                                                         | a == 'a'  = go as t2
                                                                                                                         | a == 'i'  = go as t3
                                                                                                                         | a == 'r'  = go as t4
                                                                                                                         | a == 's'  = go as t5
                                                                                                                         | a == 'n'  = go as t6
                                                                                                                         | a == 't'  = go as t7
                                                                                                                         | a == 'o'  = go as t8
                                                                                                                         | a == 'l'  = go as t9
                                                                                                                         | a == 'u'  = go as t10
                                                                                                                         | a == 'c'  = go as t11
                                                                                                                         | a == 'm'  = go as t12
                                                                                                                         | a == 'p'  = go as t13
                                                                                                                         | a == 'd'  = go as t14
                                                                                                                         | a == 'g'  = go as t15
                                                                                                                         | a == 'h'  = go as t16
                                                                                                                         | a == 'b'  = go as t17
                                                                                                                         | a == 'f'  = go as t18
                                                                                                                         | a == 'v'  = go as t19
                                                                                                                         | a == 'y'  = go as t20
                                                                                                                         | a == 'q'  = go as t21
                                                                                                                         | a == 'k'  = go as t22
                                                                                                                         | a == 'z'  = go as t23
                                                                                                                         | a == 'x'  = go as t24
                                                                                                                         | otherwise = go as t25

        go _ _ = False

Le module complet se trouve ici : WordTree2.hs

This tree cannot rebalance itself.

Customized multi-branch tree (WordTree2g)

This tree is a modification of the WordTree2 tree with a grouping of some letters in some branches of the tree. This grouping is done according to the occurrences of these letters and done with the elem function.

The grouping of letters will be:

E
A
I
R
S
N
T
O
LU
CMP
DGHB
…

member v0 tree = go v0 tree
    where
        go _  Tip                                                         = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12) | a == 'e'      = go as t1
                                                                     | a == 'a'      = go as t2
                                                                     | a == 'i'      = go as t3
                                                                     | a == 'r'      = go as t4
                                                                     | a == 's'      = go as t5
                                                                     | a == 'n'      = go as t6
                                                                     | a == 't'      = go as t7
                                                                     | a == 'o'      = go as t8
                                                                     | elem a "lu"   = go as t9
                                                                     | elem a "cmp"  = go as t10
                                                                     | elem a "dghb" = go as t11
                                                                     | otherwise     = go as t12

        go _ _ = False

The full module can be found at this adress : WordTree2g.hs

Arbre personnalisé multibranche (WordTree3)

This tree is an improvement of the WordTree2 tree. The order of the trees corresponds to the order of the letters most commonly used in the list of words to be filtered³:

member v0 tree = go v0 tree
    where
        go _  Tip = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t24) | a == 'e'  = go as t1
                                                                                                                     | a == 't'  = go as t2
                                                                                                                     | a == 'u'  = go as t3
                                                                                                                     | a == 's'  = go as t4
                                                                                                                     | a == 'i'  = go as t5
                                                                                                                     | a == 'a'  = go as t6
                                                                                                                     | a == 'o'  = go as t7
                                                                                                                     | a == 'r'  = go as t8
                                                                                                                     | a == 'n'  = go as t9
                                                                                                                     | a == 'l'  = go as t10
                                                                                                                     | a == 'c'  = go as t11
                                                                                                                     | a == 'd'  = go as t12
                                                                                                                     | a == 'q'  = go as t13
                                                                                                                     | a == 'p'  = go as t14
                                                                                                                     | a == 'f'  = go as t15
                                                                                                                     | a == 'v'  = go as t16
                                                                                                                     | a == 'm'  = go as t17
                                                                                                                     | a == 'x'  = go as t18
                                                                                                                     | a == 'z'  = go as t19
                                                                                                                     | a == 'h'  = go as t20
                                                                                                                     | a == 'g'  = go as t21
                                                                                                                     | a == 'j'  = go as t22
                                                                                                                     | a == 'y'  = go as t23
                                                                                                                     | otherwise = go as t24

        go _ _ = False

The full module can be found at this adress : WordTree3.hs

Customized multi-branch tree (WordTree3g)

This tree is a modification of the WordTree3 tree with a grouping of some letters in some branches of the tree. This grouping is done according to the occurrences of these letters and done with the elem function.

The grouping of letters will be:

E
T
U
R
S
IA
OR
NLC
DQPFVMXZH
…

member v0 tree = go v0 tree
    where
        go _  Tip                                             = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9) | a == 'e'           = go as t1
                                                         | a == 't'           = go as t2
                                                         | a == 'u'           = go as t3
                                                         | a == 's'           = go as t4
                                                         | elem a "ia"        = go as t5
                                                         | elem a "or"        = go as t6
                                                         | elem a "nlc"       = go as t7
                                                         | elem a "dqpfvmxzh" = go as t8
                                                         | otherwise          = go as t9

        go _ _ = False

The full module can be found at this adress : WordTree3g.hs

Test functions

Reading text files

The lists of words to be filtered and used as filters are read from a text file with readFile then split into words, converted to lower case with map (map toLower) . words and returned as a doublet.

This function is executed outside of the test to avoid taking into account the construction time of the of the lists in the test.

readWordsIO :: FilePath -> FilePath -> IO ([String], [String])
readWordsIO pathfil pathtxt = do
    wrdsfil <- map (map toLower) . words <$> readFile pathfil
    wrdstxt <- map (map toLower) . words <$> readFile pathtxt
    return (wrdsfil, wrdstxt)

Filtering

The filtering of the lists will be done with identical functions using the member function function of each module.

Only the list fitting function is a bit different, using notElem.

filterByList (wrdsfilt, wrdstxt) = filter (`notElem` wrdsfilt) wrdstxt

filterTextSet (wrdsfilt, wrdstxt) = filter (\w -> not (Set.member w tree)) wrdstxt
    where tree = Set.fromList wrdsfilt

filterText1 (wrdsfilt, wrdstxt) = filter (\w -> not (WT1.member w tree)) wrdstxt
    where tree = WT1.fromList wrdsfilt

Construction

The construction of the trees will be evaluated simply with the fromList function of each module.

constSet (wrdsfilt, wrdstxt) = Set.fromList wrdsfilt

constTree1 (wrdsfilt, wrdstxt) = WT1.fromList wrdsfilt

Results

Global analysis

The values given by the benchmark are the following:

	Lists	Set	WordTree 1	WordTree 2	WordTree 2g	WordTree 3	WordTree 3g
sec	96,44437	8,73092	7,63710	6,94024	10,84359	6,83521	15,85130
xfaster/List	/	11,0463	12,6284	13,8964	8,8941	14,1099	6,0843
%lower/List	/	-90,95 %	-92,08 %	-92,80 %	-88,76 %	-92,91 %	-83,56%
xfaster/Set	/	/	1,14	1,26	0,81	1,28	0,55
%lower/Set	/	/	-12,53 %	-20,51 %	24,20 %	-21,71 %	81,55 %

We already notice the difference between filtering by list and filtering with trees. The gain is enormous! The processing is 10 times faster than with the list.

We also notice that if the filtering with the Set is very efficient, a filtering with a tree dedicated to each letter is a bit faster.

We can already see that the idea of grouping letters in some branches with the elem function is a bad idea. These functions, although faster than sorting by list, are slower than sorting by Set and especially slower than their non-grouped version.

We can also see that the WordTree2 tree is a bit faster than the WordTree1 tree and that the WordTree3 tree is a bit faster than the WordTree2.

It is therefore interesting to have a tree whose sorting is optimized for the word list used as a filter.

Time to build the trees

As you can see on this summary table, the construction time of a set is much more important than the WordTree1 tree (about 5 times slower). This can be explained very well because Sets are trees that automatically balance themselves during their construction. This requires tests and modifications of the tree which leads to a slower construction.

We also notice that the more we use an adapted (optimized) tree for the words used as a filter, the lower its construction time.

	Set	WordTree 1	WordTree 2	WordTree 2g	WordTree 3	WordTree 3g
sec	0,04696	0,00831	0,00781	0,01192	0,00784	0,01606
xfaster/Set	1,00	5,65	6,02	3,94	5,99	2,92
%lower/Set	/	-82,30 %	-83,38 %	-74,62 %	-83,31 %	-65,80 %

Conclusion

Using a tree filtering is extremely interesting and allows to speed up considerably compared to a filtering by list. The gain is of the order of 90%.

A tree adapted to the words used as a filter also allows to accelerate the processing. The gain being up to up to 20%/25% compared to a Set.

Nevertheless, statistics of the letters used and to implement a tree optimized for this series takes time. series takes time.

For an occasional and standard use, the use of a Set remains the simplest and fastest solution to implement.

For intensive use on a massive amount of data, using a dedicated tree can significantly speed up processing time.

Notes

1.↑

The processing time will increase exponentially.

2.↑

Most used letters in the French language

According to this page, the most common letters used in the French language are:

E 12,80 %
A 8,82 %
I 8,53 %
R 7,72 %
S 7,17 %
N 7,01 %
T 6,63 %
O 6,25 %
L 4,68 %
U 4,16 %
C 3,72 %
M 2,89 %
P 2,46 %
D 2,37 %
G 1,95 %
H 1,93 %
B 1,76 %
F 1,29 %
V 1,19 %
Y 0,72 %
Q 0,54 %
K 0,52 %
Z 0,46 %
X 0,35 %
J 0,25 %

3.↑

A separate analysis showed that the most common letters in the search words are:

\ [('e',133),('s',103),('u',102),('t',90),('i',85),('a',77),('o',61),('n',59),('r',49),('l',46),('c',34),('d',27),('q',21),('p',21),('v',21),('f',19),('m',18),('x',9),('g',9),('z',8),('j',5),('y',4),('h',4),('b',4)] \ [('a',77),('b',4),('c',34),('d',27),('e',133),('f',19),('g',9),('h',4),('i',85),('j',5),('l',46),('m',18),('n',59),('o',61),('p',21),('q',21),('r',49),('s',103),('t',90),('u',102),('v',21),('x',9),('y',4),('z',8)]

E 133
S 103
U 102
T 90
I 85
A 77
O 61
N 59
R 49
L 46
C 34
D 27
Q 21
P 21
V 21
F 19
M 18
X 9
G 9
Z 8
J 5
Y 4
H 4
B 4

Benchmarks